ECE 5760 Final Project: Stockfish NNUE on DE-1 SoC

by Sanath Kumar (sk2794), Pawan Perera (pp497), Chris Yang (cy524)

Introduction

For our final project, we decided to implement the Stockfish NNUE on the DE-1 SoC FPGA. Stockfish is a powerful open source chess engine that evaluates the advantage for white or black, and computes optimal moves based on chess board piece positions. Stockfish uses a type of neural network called NNUE (Efficiently Updatable Neural Network). The NNUE is responsible for computing an evaluation score based on the state of the chess board. As the name suggests, the NNUE is easily updatable and extremely fast. The NNUE’s second operating principle creates a way to efficiently update the neural network by making minimal changes to the network inputs between subsequent evaluations. Based on the size of the inputs and necessary stored memory, we decided to pursue the implementation of this unique neural network as a hardware acceleration project on the FPGA.

High-Level Design

Rationale and Sources of Project Idea

When discussing ideas for this project, we all agreed that the common theme behind all our labs was that they all ran a computationally intensive algorithm on the FPGA that showcased and utilized the FPGA’s capability of hardware acceleration. After playing a few games of chess on chess.com, we discovered the fascinating Stockfish engine that drives the game, and thought that implementing a neural network in Verilog would also showcase the FPGA’s power in the quick processing of substantial amounts of data. 

Background Math and Logical Structure

To understand the whole neural network algorithm, it is first important to understand how inputs are formatted into feature sets. There are many types of feature sets, however, the most common one is HalfKP, which is the feature set that we decided to implement. HalfKP consists of feature sets that are tuples of (our_king_square, piece_square, piece_type, piece_color). Each feature set stores information about each piece (piece location, piece type, and piece color) on the board relative to one side’s king piece. With 64 possible our_king_square locations, 64 possible piece_square positions, 5 possible piece_types (queen, rook, bishop, knight, pawn), and 2 piece_color possibilities (black and white), there are a total number of 64*64*5*2 = 40960 feature sets per side. With all these features, only a maximum of 30 of these will be active for all the possible pieces other than one’s king. The NNUE would store a 1 for all the active feature sets and a 0 for all the inactive feature sets, as shown in the example in Figures 1 & 2.

Figure 1: Chess Board Example

Figure 2: Chess Board Example

Here, we can see that in the white’s perspective, the active feature sets are (A1, C3, pawn, white) and (A1, D4, rook, black). However, in the black’s perspective, it sees (B1, C6, pawn, black) and (B1, D5, rook, white). This is because each side’s perspective sees themselves as white and the opponent as black. Think of it as if the board were flipped horizontally, the colors are swapped, and the number and letters denoting each piece position stayed the same.

Now that we understand how the input is formatted, Figures 3 and 4 below illustrate the model architecture of an NNUE with the HalfKP feature set. 

Figure 3: HalfKP NNUE Model Architecture [chessprogramming.com]

Figure 4: Simplified HalfKP NNUE Model Architecture

Note that Figure 3 has a total number feature sets of 41024, since they incorporated an extra piece type. As stated before, our model has a total of 40960 input features per side. Regardless, the model architecture stays the same. Both the 40960 vectors are inputted into a linear layer which multiplies the vectors by a set of 256 x 40960 16-bit weights to produce a 256-element vector. This vector is then added to a 256-element vector of 16-bit biases which would produce a 256-element vector of 16-bit numbers. However, the NNUE truncates all numbers in the vector to 8 bits through a clipped ReLU function. Next, both 256-element vectors are combined into a 512-element vector of 8-bit ints. The next linear layer takes in the 512-element vector, multiplies it by a 32x512 matrix of 8-bit weights, and adds a 32-element vector of 16-bit biases, and clipped to an output a 32-element vector of 8-bit numbers. This is the output of the Linear Layer 2. Similarly, the next linear layer takes in the 32-element vector, multiplies it by a 32x32 matrix of 8-bit weights, adds a 32-element vector of 16-bit biases, and again clipped to output a 32-element vector of 8-bit numbers. This is the output of Linear Layer 3. The last linear layer takes in the the 32-element vector output of Linear Layer 3, multiplies it by a 1x32 matrix of 8-bit weights, and adds a bias to produce a scalar output. This output is then scaled to produce the NNUE evaluation score.

Our design uses the HPS on the DE-1 SoC to store a 2D array of weights for each linear layer. Therefore, it stores the 2x 256x40960 matrix of weights for the first layer, 32x512 matrix of weights for the second layer, 32x32 matrix of weights for the third layer, and lastly a one dimensional 32-element array of weights for the final output layer. The HPS also stores the biases for each layer in their own arrays. The FPGA is responsible for propagating through all the linear layers. Because we cannot send a whole 2D array directly from HPS to FPGA, we can only send in one row at a time. We create five dual-port M10Ks shared between the HPS and FPGA so that the HPS can load one row of weights as the FPGA’s linear layer accumulates the sum of the weights multiplied by the input vector at each row. These are synced through two PIO ports, “trigger” and “ready”. The FPGA would send a “trigger” to the HPS to request a new row of weights. Therefore, each layer would send “trigger” Y times for Y being the number of rows in the weight matrix. In response, when the HPS gets the trigger request, it loads the dual-port M10K with a new row of weights. Once this loading is complete, it sends “ready” to the FPGA indicating that a new row of weights has been loaded. Once the FPGA gets this ready signal, it continues its computation for the row. A summary of this is shown in Figure 5 below.

Figure 5: Schematic of Loading Weights between HPS and FPGA

Hardware Design

Figure 6: Hardware Design Modules

Feature Set M10K Addressing

We first implement the two M10K blocks of 40960 entries to store the feature sets of both sides. To index into each feature set, we use the following formula provided by the developers of HalfKP in NNUE.MD: 

Figure 7: Feature Index Calculation, NNUE.MD

To implement this on both sides, we created signals preindex_white, piece_color_white, piece_square_white, king_square_white, and index_white for the white side’s perspective, and preindex_black, piece_color_black, piece_square_black, king_square_wblack, and index_black for the black side’s perspective. We then use combinational assign statements to perform the index calculations for both perspectives using the formula above and modifying the multiplications to bit shifts, as seen in Figure 8 below.

Figure 8: Feature Index Calculation Implemented in Verilog 

This formula allows for exactly 40960 unique indices to our input M10K according to each variable in the input feature set. When updating a piece as in input to the NNUE, each piece must be updated in two feature sets, one for the white’s (own) perspective, and one for the black’s (opponent) perspective. In our piece loading FSM, we made it so that updating one piece would automatically do that. This was accomplished by only updating the piece_color_white and piece_square_white variables. We created a unique indexing scheme for board locations so that only updating piece_square_white would also translate into updating piece_square_black.

Figure 9: Piece_square and king_square addressing

Figure 10: Piece_color and piece_square translation between both sides

As shown in Figures 9 and 10 above, the board is uniquely indexed so that piece_square black would just be 63 - piece_square_white. For example, index 0 on white’s perspective would be index 63 on black’s perspective, and index 63 on white’s perspective would be index 0 on black’s perspective. Index 20 on white’s perspective would be index 43 on black’s perspective.

Linear Layer Modules

Figure 11: Linear Layer Computation [9]

The computation of each linear layer is simple. As seen in Figure 11 above, each output row element denoted a1,a2,a3,..., ai is the sum of each input vector row element multiplied by the corresponding weight in that row. For this to work, the linear layer needs an FSM to step through each row of the output. For each row of the output, the FSM would step through each row of the input and multiply it by each column in the output row. The FSM would keep a recurrent sum sum for each output row and add it to the output bias vector. The following FSM diagram our linear layers accomplishes just that.

Figure 12: State Machine Diagram of Linear Layer Modules

Each linear layer module waits for a reset signal from the previous linear layer, except for the first linear layer, which gets the reset signal from when piece positions are done loading. When the linear layer resets, it sends a signal through a PIO port to the HPS called “trigger”, telling the HPS to load a new row of weights to the shared M10K. Once the HPS finishes this, the HPS sends the “ready” signal to the linear module indicating that a row of new weights has been loaded. Once the linear layer module in the FPGA receives this ready signal,  it enters the first state which sets “trigger” back down to low. This then goes to State 2 in the next cycle, continuously cycling between State 2 and State 3 to read a new weight from a new column, and accumulate the sum of each individual weight multiplied by the respective index of the input. In State 2, the FSM reads sets read_addr to the column counter value to obtain values read_data from the input data M10K and read_data_weights from the weights M10K at the address of the column. With a signed_mult module instantiated in each linear layer, signal temp_mult returns the product of the input data multiplied by the weight. In State 3, temp_mult gets added to temp_sum, and loops back to State 2 until the counter has reached its max value. Once a whole row is completed, the FSM moves to State 4, reads a bias, adds it to the running sum, clips this sum through a ReLU function, and writes the resulting output to the respective element to that row of the output vector. This is then repeated for each row by transitioning back to State 1 until the row counter reaches the final row and the last element is written to the output vector. When this occurs, this linear layer is finished, and sends the “proceed” signal to the next linear layer to start its computation.

Software Design

Our HPS works alongside our FPGA, sending data back and forth continuously. Specifically, we load weights and biases in for loops in HPS, as it is much easier to do so than in the FPGA. Then, each time we need to use the weights and biases to update the M10K, we need to alert the HPS to activate a certain block of code, send over values one line at a time, then send a signal back to the FPGA letting it know that it can begin processing the information.

One of the first things we do in our code is declare a couple versions of fixed point values, fix5 and fix13. fix5 represents a 3.5 fixed-point number, which means 3 bits to the left of the decimal point (which includes a sign bit), and 5 to the right. That means it can take values from -4 to +4, not including +4, with 5 decimal bits of precision. fix13 formatting, similarly, represents a 3.13 fixed-point number which can range from -4 to +4, but with 13 bits of precision, which is helpful for more intricate decimal values. Our first batch of weights in Linear Layer 1 (256 x 40960) are loaded as fix13 points, since they need 16 bits, but the rest use fix5, as they require only 8 bits.

We need to first initialize bases and spans for our AXI, lightweight, and heavyweight buses to match the FPGA from QSYS. Then, we can define offsets for our 5 M10K blocks and our 10 PIOs, and create pointers to reference the addresses. 2 M10Ks, one for white and one for black, are of size 256 x 40960, each with a corresponding bias array of size 256. The next one is of size 32 x 512, with a bias array of size 32. The next layer uses an M10K with size 32 x 32, also with a bias array of size 32. The final M10K is only a 1D array of size 32, which is technically size 1 x 32, with just a single corresponding bias value. After everything is created properly, we load them with initial values. The two larger M10Ks are loaded first in a nested for loop, as they are larger and have 16-bit values. The other three are loaded next, in a series of nested for loops, sized as necessary. Finally, we set pio_reset low, then high, then low again, to signal the start of the main program.

The system begins in FPGA with reading the state of the chess board. Once it is ready for the first linear layers, it sends the trigger1 signal to HPS through a PIO port. The weights are sent to the FPGA through the M10K one row at a time. At the end of each row, we also send over that row’s bias value through a separate PIO port. Each row is 40960 values long, and there are 256 rows for each of the two M10Ks we are loading. This is the longest part of the HPS process. Throughout the loading process, we occasionally print to the serial terminal “begin trigger1” and “Loading row done __” with the row number that was just loaded. We do this not only to update the user about the progress of our program, but also for makeshift delays in our code to ensure that everything runs smoothly. At the end of each row, we send the ready1 signal back to the FPGA through a PIO port, and wait for the trigger again. After that loops 256 times for each row, we move on to the next stage.

After the two 256-row M10Ks are combined into one 512-row M10K, we repeat the same process as the first linear layer. This time, we wait for the trigger2 signal to be sent through the PIO port. Then, we load 32 rows of length 512, along with the appropriate bias values, sending the ready2 signal back to the FPGA each time. After all 32 rows are loaded, we wait for trigger3. Once we begin the third linear layer, we load 32 rows of length 32, as well as their bias values, sending ready3 back to the FPGA after each row. Once that process is completed, after a small usleep(), we load the final single row of length 32 and the final bias value. We send ready4 back to the FPGA, and print to the serial terminal the final value of the evaluation score, which was calculated in the FPGA and sent over through a PIO port.

Results

To obtain tested results, we used plausible board states generated using chess.com which has a built-in Stockfish evaluator. To evaluate the efficacy of our program, we compared trends between control board states and board states with a moderate to high favor on one side. These board states were loaded using the aforementioned input indexing logic and initialization FSM. While our program does not yet have the pre-trained Stockfish NNUE weights loaded, our input layer follows the theory of having either side have opposite weights loaded for some fidelity. The rest of the layers have equal weights across all dimensions. We will see the validity of this setup, as well as the limitations, as we compare it against the actual Stockfish. Additionally, it must be emphasized that we compare the trends of the relative score as the relative score of our program will not directly match Stockfish due to the differing weight values, but the similar relationships against the control will demonstrate functionality. 

Figure 13: Board starting position (chess.com)

Figure 13 above displays the starting board position as evaluated by Stockfish, seen by the white highlighted float number on the right side. These numbers show a slight advantage to the white side, +0.08, since starting boards always have a slight advantage to the white since they move first, even though no moves have been taken. Below, Figure 14 shows the output waveform from our test done using our hardware with the same board position loaded. Each trigger signal to ready signal to sum accumulation for each layer is observable in the trace as well. Recall that all are outputted to memory before being set back to zero in the FSM. Our eval score comes out to be 0001111110110010 which is a slightly different fixed point from the rest of our system. Since we output a 16-bit sum, constituting 8-bit 3.5 fixed-point addends, the result is a 11.5 fixed point number. In turn, our float output for comparison against Stockfish is 253.8125. With this we have an established control.

Figure 14: ModelSim output waveform

Before moving on to further tests, below shows the M10k content of each layer output to show the forward propagation. 

The next test was done using a board heavily skewed in white’s favor, with white on the verge of victory. Our program outputted an eval score of 0001100001001111, as seen below, or 194.46875 as a float. 

This test above has a board heavily skewed in the favor of black side. Stockfish outputs an evaluation score of -35.8, negative indicating black’s favor. Our implementation results in an evaluation score of 0001100001001111, as seen below, or 194.46875 as a float. 

This is a test done on a more nuanced board position, with white in the lead, while both sides have similar captures, although positioning is different. Stockfish’s optimal evaluation derives a +4.35 in white’s favor in this case. Our evaluation returns a value of 0001100001001111 which is a float of 194.46875. 

Altogether, based on our comparison of our implementation to Stockfish, it can be seen that we exhibit similar trends. Our control state showed a float value of 253.8125 which corresponds to the +0.08 from Stockfish, an almost 0 even state. The heavily skewed black state, which was -35.8 in Stockfish, attained an evaluation score of 194.46875. Compared to our control, this does show a decrease which is comparable to Stockfish with values under the control representing a favorable position for black side. Unfortunately, the flaws of our testing show themselves in the next test. With a more nuanced board position, evaluating to +4.35 in Stockfish, our implementation is not as sensitive to said nuance. This is due to the rudimentary weights implemented, instead of the pre-trained weights. Therefore, the system is more sensitive to piece disparity rather than piece position or type as all weights are equal, save for black side and white side being opposite. Given the functionality of the algorithm, had pre-trained weights been implemented, the hardware would demonstrate comparable results for nuanced positions such as these, where the capture count is similar. 


Future Steps

Currently, our program uses a preset chess board position, but ideally we would like to be able to input moves to update the board. We began trying to implement this functionality, but did not complete it. Our idea was to have the user input normal chess notation into the serial terminal, such as “Knight to E4”, and we would parse the input into a format the program can understand. We would erase a 1 in the board state from the piece’s previous position, and add a 1 at the piece’s new position. This would have to be updated from both white and black’s perspectives for each move. From there, our program can evaluate the current state of the game.

It would be interesting to look at abnormal moves, such as castling, both kingside and queenside, promoting a pawn, and en passant. These situations would likely require specific case statements to ensure that the appropriate move is being conveyed to the system.

Another difference between our system and Stockfish is that we only output a number evaluation currently. In the late stages of a chess game, Stockfish can output M11, for example, which means there is a forced checkmate in 11 moves. This would require rigorous depth to search all possibilities and recognize a checkmate position. Our program is unable to detect checkmate, or any checks.

The Stockfish engine does not only evaluate positions though, it also gives an optimal move for the user to play next. That would require checking all possible moves in the current position and repeatedly running the NNUE to get an evaluation on each move. The engine would then choose the move with the highest evaluation score to achieve the greatest relative advantage. This would be a very interesting challenge to tackle with hardware acceleration, but unfortunately we did not have enough time to complete all of these desired tasks this semester.

Conclusion

To conclude, our final product fell a little short of our expectations. As seen in the results, with hard-coded weights, we were able to get all layers propagating with a final evaluation score that made sense in ModelSim. However, moving it to actual hardware in Quartus proved to be very difficult because we could not see the contents of the M10K memory blocks in the FPGA. Still, we got all layers to propagate from input to output, regardless of whether our final evaluation score made sense or not. We also did not have time to load pretrained weights into the HPS, though we have a mechanism to send hard-coded weights from the HPS to the FPGA. With more time to load weights into the hardware, we could have obtained an evaluation score that closely follows the one seen in Stockfish. This was a particularly challenging project because all three of us are new to machine learning. Another challenging aspect is having to account for so many variables, a total of 81920 inputs, (2x 40960 x 256) + (32 x 512) + (32 x 32) + 32 = 10503200 weights, and synchronization between all layers and all M10K blocks was extremely difficult. This project taught us the complexity of machine learning in hardware and how difficult it is to achieve hardware acceleration ASICs for machine learning such as ones found in modern GPUs. 

There were no applicable standards or intellectual property considerations for our design. The Stockfish engine is completely open source, as their documentation is cited in the references in Appendix C. Other than using the signed_mult and M10K block modules in our previous labs, our design in Verilog and HPS C program was completely written by ourselves.

See below for our final demo!

Appendices

Appendix A: Permissions

Appendix B: Program Code